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Abstract 

This paper starts from the observation that multiple top 
performing pedestrian detectors can be modelled by using 
an intermediate layer filtering low-level features in combin¬ 
ation with a boosted decision forest. Based on this observa¬ 
tion we propose a unifying framework and experimentally 
explore different filter families. We report extensive results 
enabling a systematic analysis. 

Using filtered channel features we obtain top perform¬ 
ance on the challenging Caltech and KITTI datasets, while 
using only HOG-\-LUV as low-level features. When adding 
optical flow features we further improve detection quality 
and report the best known results on the Caltech dataset, 
reaching 93% recall at 1 FPPI. 


1. Introduction 

Pedestrian detection is an active research area, with 
1000+ papers published in the last decade \ and well estab¬ 
lished benchmark datasets [9, 13]. It is considered a canon¬ 
ical case of object detection, and has served as playground 
to explore ideas that might be effective for generic object 
detection. 

Although many different ideas have been explored, and 
detection quality has been steadily improving [2], arguably 
it is still unclear what are the key ingredients for good ped¬ 
estrian detection; e.g. it remains unclear how effective parts, 
components, and features learning are for this task. 

Current top performing pedestrian detection methods all 
point to an intermediate layer (such as max-pooling or fil¬ 
tering) between the low-level feature maps and the classi¬ 
fication layer [40, 43, 28, 24]. In this paper we explore 
the simplest of such intermediary: a linear transformation 
implemented as convolution with a filter bank. We pro¬ 
pose a framework for filtered channel features (see figure 1) 
that unifies multiple top performing methods [8, 1, 43, 24], 

^Papers from 2004 to 2014 with "pedestrian detection" in the title, ac¬ 
cording to Google Scholar. 
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Figure 1: Filtered feature channels illustration, for a single 
weak classifier reading over a single feature channel. 
Integral channel features detectors pool features via sums 
over rectangular regions [8, ]. We can equivalently re¬ 

write this operation as convolution with a filter bank fol¬ 
lowed by single pixel reads (see §2). We aim to answer: 
What is the effect of selecting different filter banks? 


and that enables a systematic exploration of different fil¬ 
ter banks. With our experiments we show that, with the 
proper filter bank, filtered channel features reach top detec¬ 
tion quality. 

It has been shown that using extra information at test 
time (such as context, stereo images, optical flow, etc.) can 
boost detection quality. In this paper we focus on the “core” 
sliding window algorithm using solely HOG+LUV features 
(i.e. oriented gradient magnitude and colour features). We 
consider context information and optical fiow as add-ons, 
included in the experiments section for the sake of com¬ 
pleteness and comparison with existing methods. Using 
only HOG+LUV features we already reach top perform¬ 
ance on the challenging Caltech and KITTI datasets, match¬ 
ing results using optical fiow and significantly more features 
(such as LBP and covariance [40, 28]). 
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1.1. Related work 

Recent survey papers discuss the diverse set of ideas ex¬ 
plored for pedestrian detection [10, 14, 9, 2]. The most 
recent survey [2] indicates that the classifier choice (e.g. 
linear/non-linear SVM versus decision forest) is not a clear 
differentiator regarding quality; rather the features used 
seem more important. 

Creativity regarding different types of features has not 
been lacking. HOG) The classic HOG descriptor is based 
on local image differences (plus pooling and normaliza¬ 
tion steps), and has been used directly [5], as input for a 
deformable parts model [11], or as features to be boosted 
[19, 25]. The integral channel features detector [8, 1] uses 
a simpler HOG variant with sum pooling and no normaliz¬ 
ations. Many extensions of HOG have been proposed (e.g. 
[16, 11, 6, 33]). LBP) Instead of using the magnitude of 
local pixel differences, LBP uses the difference sign only 
as signal [39, 40, 28]. Colour) Although the appearance 
of pedestrians is diverse, the background and skin areas do 
exhibit a colour bias. Colour has shown to be an effective 
feature for pedestrian detection and hence multiple colour 
spaces have been explored (both hand-crafted and learned) 
[8, 17, 18, 22]. Local structure) Instead of simple pixel 
values, some approaches try to encode a larger local struc¬ 
ture based on colour similarities (soft-cue) [38, 15], seg¬ 
mentation methods (hard-decision) [26, 31, 35], or by es¬ 
timating local boundaries [20]. Covariance) Another pop¬ 
ular way to encode richer information is to compute the co- 
variance amongst features (commonly colour, gradient, and 
oriented gradient) [36, 28]. Etc.) Other features include 
bag-of-words over colour, HOG, or LBP features [4] ; learn¬ 
ing sparse dictionary encoders [32]; and training features 
via a convolutional neural network [34]. Additional features 
specific for stereo depth or optical fiow have been proposed, 
however we consider these beyond the focus of this paper. 
For our fiow experiments we will use difference of frames 
from weakly stabilized videos (SDt) [29]. 

All the feature types listed above can be used in the integ¬ 
ral channel features detector framework [8]. This family of 
detectors is an extension of the old ideas from Viola&Jones 
[37]. Sums of rectangular regions are used as input to de¬ 
cision trees trained via Adaboost. Both the regions to pool 
from and the thresholds in the decision trees are selected 
during training. The crucial difference from the pioneer 
work [37] is that the sums are done over feature channels 
other than simple image luminance. 

Current top performing pedestrian detection methods 
(dominating INRIA [5], Caltech [9] and KITTI data¬ 
sets [13]) are all extensions of the basic integral chan¬ 
nel features detector (named ChnFtrs in [8], which 
uses only HOG-l-LUV features). SquaresChnFtrs [2], 
Inf ormedHaar [43], and LDCF [24], are discussed in de¬ 
tail in section 2.2. Katamari exploits context and optical 


flow for improved performance. SpatialPooling( + ) 
[28] adds max-pooling on top of sum-pooling, and uses 
additional features such as covariance, LBP, and optical 
flow. Similarly, Regionlets [40] also uses extended fea¬ 
tures and max-pooling, together with stronger weak clas¬ 
sifiers and training a cascade of classifiers. Out of these, 
Regionlets is the only method that has also shown 
good performance on general classes datasets such as Pascal 
VOC and ImageNet. 

In this paper we will show that vanilla HOG-i-LUV fea¬ 
tures have not yet saturated, and that, when properly used, 
they can reach top performance for pedestrian detection. 

1.2. Contributions 

• We point out the link between ACF [7], 
(Squares) ChnFtrs [8, 1, 2], Inf ormedHaar 
[43], and LDCF [24]. See section 2. 

• We provide extensive experiments to enable a system¬ 
atic analysis of the filtered integral channels, covering 
aspects not explored by related work. We report the 
summary of 65-1- trained models (corresponding ^10 
days of single machine computation). See sections 4, 
5 and 7. 

• We show that top detection performance can be 
reached on Caltech and KITTI using HOG-fLUV fea¬ 
tures only. We additionally report the best known res¬ 
ults on Caltech. See section 7. 

2. Filtered channel features 

Before entering the experimental section, let us describe 
our general architecture. Methods such as ChnFtrs [8], 
SquaresChnFtrs [1, 2] and ACF [7] all use the basic 
architecture depicted in figure 1 top part (best viewed in 
colours). The input image is transformed into a set of fea¬ 
ture channels (also called feature maps), the feature vector 
is constructed by sum-pooling over a (large) set of rectangu¬ 
lar regions. This feature vector is fed into a decision forest 
learned via Adaboost. The split nodes in the trees are a 
simple comparison between a feature value and a learned 
threshold. Commonly only a subset of the feature vector is 
used by the learned decision forest. Adaboost serves both 
for feature selection and for learning the thresholds in the 
split nodes. 

A key observation, illustrated in figure 1 (bottom), is that 
such sum-pooling can be re-written as convolution with a 
filter bank (one filter per rectangular shape) followed by 
reading a single value of the convolution’s response map. 
This “filter -i- pick” view generalizes the integral channel 
features [8] detectors by allowing to use any filter bank (in¬ 
stead of only rectangular shapes). We name this generaliz¬ 
ation “filtered channel features detectors”. 


In our framework, ACF [7] has a single filter in its 
bank, corresponding to a uniform 4x4 pixels pooling re¬ 
gion. ChnFtrs [8] was a very large (tens of thou¬ 
sands) filter bank comprised of random rectangular shapes. 
SquaresChnFtrs [1, 2], on the other hand, was only 
16 filters, each with a square-shaped uniform pooling re¬ 
gion of different sizes. See figure 2a for an illustration of 
the SquaresChnFtrs filters, the upper-left filter corres¬ 
ponds to ACF ’s one. 

The Inf ormedHaar [43] method can also be seen as 
a filtered channel features detector, where the filter bank 
(and read locations) are based on a human shape template 
(thus the “informed” naming). LDCF [24] is also a particu¬ 
lar instance of this framework, where the filter bank consists 
of PCA bases of patches from the training dataset. In sec¬ 
tions 4 and 5 we provide experiments revisiting some of the 
design decisions of these methods. 

Note that all the methods mentioned above (and in the 
majority of experiments below) use only HOG-i-LUV fea¬ 
ture channels^ (10 channels total). Using linear filters 
and decision trees on top of these does not allow to re¬ 
construct the decision functions obtained when using LBP 
or covariance features (used by SpatialPooling and 
Regionlets). We thus consider the approach considered 
here orthogonal to adding such types of features. 

2.1. Evaluation protocol 

For our experiments we use the Caltech [9, 2] and KITTI 
datasets [13]. The popular INRIA dataset is considered too 
small and too close to saturation to provide interesting res¬ 
ults. All Caltech results are evaluated using the provided 
toolbox, and summarised by log-average miss-rate (MR, 
lower is better) in the [l0“^, 10^] FPPI range for the “reas¬ 
onable” setup. KITTI results are evaluated via the online 
evaluation portal, and summarised as average precision (AP, 
higher is better) for the “moderate” setup. 

CaltechlOx The raw Caltech dataset consists of videos 
(acquired at 30 Hz) with every frame annotated. The stand¬ 
ard training and evaluation considers one out of each 30 
frames (1 631 pedestrians over 4 250 frames in training, 
1 014 pedestrians over 4 024 frames in testing). 

In our experiments of section 5 we will also consider a lOx 
increased training set where every 3rd frame is used (linear 
growth in pedestrians and images). We name this extended 
training set “CaltechlOx”. LDCF [24] uses a similar exten¬ 
ded set for training its model (every 4th frame). 

Flow Methods using optical fiow do not only use addi¬ 
tional neighbour frames during training (1 4 depending 

on the method), but they also do so at test time. Because 
they have access to additional information at test time, we 
consider them as a separate group in our results section. 

^We use “raw” HOG, without any clamping, cell normalization, block 
normalization, or dimensionality reduction. 


Validation set In order to explore the design space of our 
pedestrian detector we setup a Caltech validation set by 
splitting the six training videos into five for training and 
one for testing (one of the splits suggested in [9]). Most of 
our experiments use this validation setup. We also report (a 
posteriori) our key results on the standard test set for com¬ 
parison to the state of the art. 

For the KITTI experiments we also validate some design 
choices (such as search range and number of scales) before 
submission on the evaluation server. There we use a 2 / 3 +i/s 
validation setup. 

2.2. Baselines 

ACF Our experiments are based on the open source re¬ 
lease of ACF [7]. Our first baseline is vanilla ACF re-trained 
on the standard Caltech set (not CaltechlOx). On the Cal¬ 
tech test set it obtains 32.6% MR (50.2% MR on validation 
set). Note that this baseline already improves over more 
than 50 previously published methods [2] on this dataset. 
There is also a large gap between ACF-Ours (32.6% MR) 
and the original number from ACF-Caltech (44.2% MR 
[7]). The improvement is mainly due to the change to¬ 
wards a larger model size (from 30x60 pixels to 60x120). 
All parameter details are described in section 2.3, and kept 
identical across experiments unless explicitly stated. 

Inf ormedHaar Our second baseline is a re¬ 
implementation of Inf ormedHaar [43]. Here again 
we observe an important gain from using a larger model 
size (same change as for ACF). While the original 
Inf ormedHaar paper reports 34.6% MR, Inf ormed- 
Haar-Ours reaches 27.0% MR on the Caltech test set 
(39.3% MR on validation set). 

For both our baselines we use exactly the same train¬ 
ing set as the original papers. Note that the Inf ormed- 
Haar-Ours baseline (27.0% MR) is right away the best 
known result for a method trained on the standard Cal¬ 
tech training set. In section 3 we will discuss our re¬ 
implementation of LDCF [24]. 

2.3. Model parameters 

Unless otherwise specified we train all our models using 
the following parameters. Feature channels are HOG-fLUV 
only. The final classifier includes 4096 level-2 decision 
trees (L2, 3 stumps per tree), trained via vanilla discrete 
Adaboost. Each tree is built by doing exhaustive greedy 
search for each node (no randomization). The model has 
size 60x120 pixels, and is built via four rounds of hard 
negative mining (starting from a model with 32 trees, and 
then 512, 1024, 2048, 4096 trees). Each round adds 10 000 
additional negatives to the training set. The sliding window 
stride is 6 pixels (both during hard negative mining and at 
test time). 
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Figure 2: Illustration of the different filter banks considered. 
Except for SquaresChntrs filters, only a random subset 
of the full filter bank is shown. { ■ Red, □ White, ■ Green} 
indicate { — 1, 0, +1}. 

Compared to the default ACF parameters, we use a big¬ 
ger model, more trees, more negative samples, and more 
boosting rounds. But we do use the same code-base and the 
same training set. 

Starting from section 5 we will also consider results with 
the Caltech 1 Ox data, there we use level-4 decision trees 
(L4), and Realboost [12] instead of discrete Adaboost. All 
other parameters are left unchanged. 

3. Filter bank families 

Given the general architecture and the baselines de¬ 
scribed in section 2, we now proceed to explore different 
types of filter banks. Some of them are designed using prior 
knowledge and they do not change when applied across 
datasets, others exploit data-driven techniques for learning 
their filters. Sections 4 and 5 will compare their detection 
quality. 


Inf ormedFilters Starting from the Informed- 
Haar [43] baseline we use the same “informed” filters but 
let free the positions where they are applied (instead of fixed 
in Inf ormedHaar); these are selected during the boost¬ 
ing learning. Our initial experiments show that removing 
the position constraint has a small (positive) effect. Ad¬ 
ditionally we observe that the original Inf ormedHaar 
filters do not include simple square pooling regions (a la 
SquaresChnFtrs), we thus add these too. We end up 
with 212 filters in total, to be applied over each of the 10 fea¬ 
ture channels. This is equivalent to training decision trees 
over 2120 (non filtered) channel features. 

As illustrated in figure 2d the Inf ormedFilters have 
different sizes, from 1x1 to 4x3 cells (1 cell = fix 
fi pixels), and each cell takes a value in{—1, 0, +1}. These 
filters are applied with a step size of fi pixels. For a model 
of fi0xl20 pixels this results in 200 features per channel, 
2 120 • 200 = 424 000 features in totaP. In practice con¬ 
sidering border effects (large filters are not applied on the 
border of the model to avoid reading outside it) we end up 
with ^300 000 features. When training 4 09fi level-2 de¬ 
cision trees, at most 4 09fi -3 = 12 288 features will be 
used, that is ^3% of the total. In this scenario (and all oth¬ 
ers considered in this paper) Adaboost has a strong role of 
feature selection. 

Checkerboards As seen in section 2.2 
Inf ormedHaar is a strong baseline. It is however unclear 
how much the “informed” design of the filters is effective 
compared to other possible choices. Checkerboards is 
a naive set of filters that covers the same sizes (in number 
of cells) as InformedHaar/InformedFilters and 
for each size defines (see figure 2b): a uniform square, all 
horizontal and vertical gradient detectors (±1 values), and 
all possible checkerboard patterns. These configurations 
are comparable to Inf ormedFilters but do not use the 
human shape as prior. 

The total number of filters is a direct function of the 
maximum size selected. For up to 4x4 cells we end up 
with fil filters, up to 4x3 cells 39 filters, up to 3x3 cells 25 
filters, and up to 2x2 cells 7 filters. 

RandomFilters Our next step towards removing a 
hand-crafted design is simply using random filters (see fig¬ 
ure 2c). Given a desired number of filters and a maximum 
filter size (in cells), we sample the filter size with uniform 
distribution, and set its cell values to ±1 with uniform prob¬ 
ability. We also experimented with values { — 1, 0, +1} and 
observed a (small) quality decrease compared to the binary 
option). 


^“Feature channel” refers to the output of the first transformation in 
figure 1 bottom. “Filters” are the convolutional operators applied to the 
feature channels. And “features” are entries in the response maps of all 
filters applied over all channels. A subset of these features are the input to 
the learned decision forest. 
















































Detection quality on Caltech validation set 



Figure 3: Detection quality (log-average miss-rate MR, 
lower is better) versus number of filters used. All models 
trained and tested on the Caltech validation set (see §4). 


The design of the filters considered above completely ig¬ 
nores the available training data. In the following, we con¬ 
sider additional filters learned from data. 

LDCF [24] The work on PCANet [3] showed that apply¬ 
ing arbitrary non-linearities on top of PC A projections of 
image patches can be surprisingly effective for image clas¬ 
sification. Following this intuition LDCF [24] uses learned 
PC A eigenvectors as filters (see figure 2e). 

We present a re-implementation of [24] based on ACF’s 
[7] source code. We try to follow the original description 
as closely as possible. We use the same top 4 filters of 
10x10 pixels, selected per feature channel based on their 
eigenvalues (40 filters total). We do change some para¬ 
meters to be consistent amongst all experiments, see sec¬ 
tions 2.3 and 5. The main changes are the training set (we 
use Caltech 1 Ox, sampled every 3 frames, instead of every 
4 frames in [24]), and the model size (60x120 pixels in¬ 
stead of 32x64). As will be shown in section 7, our imple¬ 
mentation (LDCF-Ours) clearly improves over the previ¬ 
ously published numbers [24], showing the potential of the 
method. 

For comparison with PcaForeground we also consider 
training LDCF 8 where the top 8 filters are selected per chan¬ 
nel (80 filters total). 

PcaForeground In LDCF the filters are learned using 
all of the training data available. In practice this means 
that the learned filters will be dominated by background 
information, and will have minimal information about the 
pedestrians. Put differently, learning filters from all the 
data assumes that the decision boundary is defined by a 
single distribution (like in Linear Discriminant Analysis 
[23]), while we might want to define it based on the relation 
between the background distribution and the foreground 
distribution (like Fisher’s Discriminant Analysis [23]). In 
PcaForeground we train 8 filters per feature channel, 4 


learned from background image patches, and 4 learned from 
patches extracted over pedestrians (see figure 2f). Com¬ 
pared to LDCF8 the obtained filters are similar but not 
identical, all other parameters are kept identical. 

Other than via PcaForeground/LDCF8 , it is not clear 
how to further increase the number of filters used in LDCF. 
Past 8 filters per channel, the eigenvalues decrease to neg¬ 
ligible values and the eigenvectors become essentially ran¬ 
dom (similar to RandomFliters). 

To keep the filtered channel features setup close to 
InformedHaar, the filters are applied with a step of 
6 pixels. However, to stay close to the original LDCF, 
the LDCF/PcaForeground filters are evaluated every 
2 pixels. Although (for example) LDCF8 uses only ^10% 
of the number of filters per channel compared to Che- 
ckerboards4x4, due to the step size increase, the ob¬ 
tained feature vector size is ^40%. 

4. How many filters? 

Given a fixed set of channel features, a larger filter bank 
provides a richer view over the data compared to a smaller 
one. With enough training data one would expect larger 
filter banks to perform best. We want thus to analyze the 
trade-off between number of filters and detection quality, as 
well as which filter bank family performs best. 

Figure 3 presents the results of our initial experiments on 
the Caltech validation set. It shows detection quality versus 
number of filters per channel. This figure densely summar¬ 
izes ^30 trained models. 

InformedFilters The first aspect to notice is 
that there is a meaningful gap between Informed- 
Haar-Ours and InformedFilters despite having a 
similar number of filters (209 versus 212). This validates 
the importance of letting Adaboost choose the pooling loc¬ 
ations instead of hand-crafting them. Keep in mind that 
Inf ormedHaar-Ours is a top performing baseline (see 
§ 2 . 2 ). 

Secondly, we observe that (for the fixed training data avail¬ 
able) ^50 filters is better than ^200. Below 50 filters the 
performance degrades for all methods (as expected). 

To change the number of filters in InformedFilters 
we train a full model (212 filters), pick the N most 
frequently used filters (selected from node splitting in 
the decision forest), and use these to train the de¬ 
sired reduced model. We can select the most fre¬ 
quent filters across channels or per channel (marked as 
Inf . FiltersPerChannel). We observe that per chan¬ 
nel selection is slightly worse than across channels, thus we 
stick to the latter. 

Using the most frequently used filters for selection is clearly 
a crude strategy since frequent usage does not guarantee 
discriminative power, and it ignores relation amongst fil- 













Aspect 


MR AMR 


Training 

Method 

L2 

L3 

L4 

L5 

Caltech 

CaltechlOx 

ACF 

50.2 

52.6 

42.1 

49.9 

48.8 

44.9 

48.7 

41.3 

Caltech 

Checker¬ 

32.9 

30.4 

28.0 

31.5 

CaltechlOx 

boards 

37.0 

31.6 

24.7 

24.7 


Table 1: Effect of the training volume and decision tree 
depth (Ln) over the detection quality (average miss-rate 
on validation set, lower is better), for ACF-Ours and 
Checkerboards variant with (61) filters of 4x4 cells. 
We observe a similar trend for other filter banks. 

ters. We find this strategy good enough to convey the main 
points of this work. 

Checkerboards also reaches best results in the ^50 
filters region. Here the number of filters is varied by chan¬ 
ging the maximum filter size (in number of cells). Regard¬ 
ing the lowest miss-rate there is no large gap between the 
“informed” filters and this naive baseline. 

RandomFiIters The hexagonal dots and their devi¬ 
ation bars indicate the mean, maximum and minimum miss- 
rate obtained out of five random runs. When using a larger 
number of filters (50) we observe a lower (better) mean but 
a larger variance compared to when using fewer filters (15). 
Here again the gap between the best random run and the 
best result of other methods is not large. 

Given a set of five models, we select the N most frequently 
used filters and train new reduced models; these are shown 
in the RandomFi Iters line. Overall the random filters 
are surprisingly close to the other filter families. This indic¬ 
ates that expanding the feature channels via filtering is the 
key step for improving detection quality, while selecting the 
“perfect” filters is a secondary concern. 

LDCF/PcaForeground In contrast to the other filter 
bank families, LDCF under-performs when increasing the 
number of filters (from 4 to 8) while using the standard Cal¬ 
tech training set (consistent with the observations in [24]). 
PcaForeground improves marginally over LDCF8. 

Takeaways From figure 3 we observe two overall trends. 
First, the more filters the merrier, with ~50 filters as sweet 
spot for Caltech training data. Second, there is no fiagrant 
difference between the different filter types. 

5. Additional training data 

One caveat of the previous experiments is that as we 
increase the number of filters used, so does the number 
of features Adaboost must pick from. Since we increased 
the model capacity (compared to ACF which uses a single 
filter), we consider using the CaltechlOx dataset (§2.1) to 
verify that our models are not starving for data. Similar to 
the experiments in [24], we also reconsider the decision tree 
depth, since additional training data enables bigger models. 


ACF-Ours 

50.8 

- 

filters 

32.9 

+17.9 

-fL4 

28.0 

+4.9 

CaltechlOx 

24.7 

+3.3 

Realboost 

24.4 

+0.3 

Checkerboards4x4 

24.4 

+26.4 


Table 2: Ingredients to build our strong detectors (using 
Checkerboards 4x4 in this example, 61 filters). Val¬ 
idation set log-average miss-rate (MR). 

Results for two representative methods are collected in 
table 1 . First we observe that already with the original train¬ 
ing data, deeper trees do provide significant improvement 
over level-2 (which was selected when tuning over INRIA 
data [8, 1]). Second, we notice that increasing the training 
data volume does provide the expected improvement only 
when the decision trees are deep enough. For our following 
experiments we choose to use level-4 decision trees (L4) 
as a good balance between increased detection quality and 
reasonable training times. 

Realboost Although previous papers on ChnFtrs de¬ 
tectors reported that different boosting variants all obtain 
equal results on this task [8, 1], the recent [24] indicated 
that Realboost has an edge over discrete Adaboost when 
additional training data is used. We observe the same beha¬ 
viour in our CaltechlOx setup. 

As summarized in table 2 using filtered channels, deeper 
trees, additional training data, and Realboost does provide a 
significant detection quality boost. For the rest of the paper 
our models trained on CaltechlOx all use level-4 trees and 
RealBoost, instead of level-2 and discrete Adaboost for the 
Caltech lx models. 

Timing When using Caltech data ACF takes about one 
hour for training and one for testing. Checkerboards- 
4x4 takes about 4 and 2 hours respectively. When using 
CaltechlOx the training times for these methods augment to 
2 and 29 hours, respectively. The training time does not in¬ 
crease proportionally with the training data volume because 
the hard negative mining reads a variable amount of images 
to attain the desired quota of negative samples. This amount 
increases when a detector has less false positive mistakes. 

5.1. Validation set experiments 

Based on the results in table 2 we proceed to evaluate on 
CaltechlOx the most promising configurations (filter type 
and number) from section 4. The results over the Caltech 
validation set are collected in table 3. We observe a clear 
overall gain from increasing the training data. 

Interestingly with enough RandomFi Iters we can 
outperform the strong performance of LDCF-Ours. We 











Filters type 

# 

filters 

Caltech 

MR 

Caltech 1 Ox 

MR 

AMR 

ACF-Ours 

1 

50.2 

39.8 

10.4 

LDCF-Ours 

4 

37.3 

34.1 

3.2 

LDCF8 

8 

42.6 

30.7 

11.9 

PcaForeground 

8 

41.6 

28.6 

13.0 

RandomFliters 

50 

36.5 

28.2 

8.3 

InformedFiIters 

50 

30.3 

26.6 

3.7 

Checkerboards 

39 

30.9 

25.9 

5.0 

Checkerboards 

61 

32.9 

24.4 

8.5 


Table 3: Effect of increasing the training set for different 
methods, quality measured on Caltech validation set (MR: 
log-average miss-rate). 


also notice that the naive Checkerboards outperforms 
the manual design of Info rme dF i 11 e r s. 



- - - 17.10% Ours-AII-in-one 


10 ^ 10 ^ 10 ^ 10 ° 10 ^ 
false positives per image 

Figure 4: Some of the top quality detection methods for 
Caltech-USA. 


6. Add-ons 

Before presenting the final test set results of our “core” 
method (section 7), we also consider some possible “add¬ 
ons” based on the suggestions from [2] . For the sake of eval¬ 
uating complementarity, comparison with existing method, 
and reporting the best possible detection quality, we con¬ 
sider extending our detector with context and optical fiow 
information. 

Context Context is modelled via the 2Fed re-scoring 
method of [27]. It is a post-processing step that merges our 
detection scores with the results of a two person DPM [ 1 ] 
trained on the INRIA dataset (with extended annotations). 
In [27] the authors reported an improvement of rsj 5 pp (per¬ 
cent points) on the Caltech set, across different methods. In 
[2] an improvement of 2.8 pp is reported over their strong 
detector (SquaresChnFtrs+DCT+SDt 25.2% MR). In 
our experiments however we obtain a gain inferior to 0.5 pp. 
We have also investigated fusing the 2Fed detection results 
via a different, more principled, fusion method [41]. We ob¬ 
serve consistent results: as the strength of the starting point 
increases, the gain from 2Fed decreases. When reaching 
our Checkerboards results, all gains have evaporated. 
We believe that the 2Fed approach is a promising one, 
but our experiments indicate that the used DPM template 
is simply too weak in comparison to our filtered channels. 

Optical flow Optical flow is fed to our detector as an ad¬ 
ditional set of 2 channels (not filtered). We use the imple¬ 
mentation from SDt [29] which uses differences of weakly 
stabilized video frames. On Caltech, the authors of [29] re¬ 
ported a ^7 pp gain over ACF (44.2% MR), while [2] repor¬ 
ted a ^5 pp percent points improvement over their strong 
baseline (SquaresChnFtrs+DCT+2Ped 27.4%MR). 
When using +SDt our results are directly comparable to 
Katamari [2] and Spat ialPooling+ [28] which both 
use optical flow too. 


Using our stronger Checkerboards results SDt 
provides a 1.4 pp gain. Here again we observe an erosion 
as the starting point improves (for confirmation, reproduced 
the ACF+SDt results [29], 43.9% ^33.9% MR). We name 
our Checkerboards + SDt detector All-in-one. 

Our filtered channel features results are strong enough to 
erode existing context and flow features. Although these re¬ 
main complementary cues, more sophisticated ways of ex¬ 
tracting this information will be required to further progress 
in detection quality. 

It should be noted that despite our best efforts we could 
not reproduce the results from neither 2Fed nor SDt on the 
KITTI dataset (in spite of its apparent similarity to Caltech). 
Effective methods for context and optical flow across data¬ 
sets have yet to be shown. Our main contribution remains 
on the core detector (only HOG-fLUV features over local 
sliding window pixels in a single frame). 

7. Test set results 

Having done our exploration of the parameters space 
on the validation set, we now evaluate the most promising 
methods on the Caltech and KITTI test sets. 

Caltech test set Figures 5 and 4 present our key results 
on the Caltech test set. For proper comparison, only meth¬ 
ods using the same training set should be compared (see 
[2, figure 3] for a similar table comparing 50-|- previous 
methods). We include for comparison the baselines men¬ 
tioned in section 2.2, Roe re i [ 1 ] the best known method 
trained without any Caltech images, MT-DFM [42] the 
best known method based on DPM, and SDN [21] the 
best known method using convolutional neural networks. 
We also include the top performers Katamari [2] and 
SpatialFoolingt [28]. We mark as “CaltechA/'x” 
both the Caltech 1 Ox training set and the one used in LDCF 
[24] (see section 5). 





















Detection quality on Caltech test set 

Roerei 48.4% 

ACF-Caltech 44.2% 

MT-DPM 40.5% 

SDN 37.9% 

ACF + SDt 37.3% 

SquaresChnFtrs 34.8% 

InformedHaar 34.6% 

ACF-Ours 32.6% 

SpatialPooling 29.2% 

Inf. Haar-Ours 27.0% 

LDCF 24.8% 

Katamari 22.5% 

SpatialPoolingi 21.9% 

LDCF-Ours 21.4% 

Inf ormedFilters 18.7% 

RandomFilters 18.5% 

Checkerboards 18.5% 

All-in-one 17.1% 

0 10 20 30 40 50 

log-average miss-rate (lower is better) 

Figure 5: Some of the top quality detection methods for 
Caltech test set (see text), and our results (highlighted with 
white hatch). Methods using optical flow are trained on 
original Caltech except our All-in-one which uses Cal¬ 
tech 1 Ox. CaltechA^x indicates Caltech 1 Ox for all methods 
but the original LDCF (see section 2.1). 

KITTI test set Figure 6 presents the results on the KITTI 
test set (“moderate” setup), together with all other reported 
methods using only monocular image content (no stereo or 
LIDAR data). The KITTI evaluation server only recently 
has started receiving submissions (14 for this task, 11 in the 
last year), and thus is less prone to dataset over-fitting. 

We train our model on the KITTI training set using almost 
identical parameters as for Caltech. The only change is a 
subtle pre-processing step in the HOG-fLUV computation. 
On KITTI the input image is smoothed (radius 1 pixel) be¬ 
fore the feature channels are computed, while on Caltech 
we do not. This subtle change provided a ^4 pp (percent 
points) improvement on the KITTI validation set. 

7.1. Analysis 

With a ~10 pp (percent points) gap between ACF/In- 
f ormedHaar and ACF/Inf ormedHaar-Ours (see fig¬ 
ure 5), the results of our baselines show the importance 
of proper validation of training parameters (large enough 
model size and negative samples). Inf ormedHaar- 
-Ours is the best reported result when training with Cal¬ 
tech lx. 

When considering methods trained on Caltech 1 Ox, we 
obtain a clear gap with the previous best results (LDCF 
24.8% MR ^ Checkerboards 18.5% MR). Using 
our architecture and the adequate number of filters one 
can obtain strong results using only HOG-fLUV features. 
The exact type of filters seems not critical, in our experi- 


KITTI Pedestrians, moderate difficulty 



Recall 

Figure 6: Pedestrian detection on the KITTI dataset (using 
images only). 

ments Checkerboards 4x3 gets best performance given 
the available training data. RandomFilters reaches the 
same result, but requires training and merging multiple 
models. 

Our results cut by half miss-rate of the best known 
convnet for pedestrian detection (SDN [21]), which in 
principle could learn similar low-level features and their fil¬ 
tering. 

When adding optical fiow we further push the state of 
the art and reach 17.1% MR, a comfortable ^5 pp improve¬ 
ment over the previous best optical fiow method (Spa- 
t ialPooling-i-). This is the best reported result on this 
challenging dataset. 

The results on the KITTI dataset confirm the strength 
of our approach, reaching 54.0% AP, just 1 pp below 
the best known result on this dataset. Competing meth¬ 
ods (Regionlets [40] and SpatialPooling [28]) 
both use HOG together with additional LBP and covariance 
features. Adding these remains a possibility for our sys¬ 
tem. Note that our results also improve over methods using 
LIDAR Image, such as Fusion-DPM [30] (46.7% AP, 
not included in figure 6 for clarity). 

8. Conclusion 

Through this paper we have shown that the seem¬ 
ingly disconnected methods ACF, (Squares)ChnFtrs, 
Inf ormedHaar, and LDCF can be all put under the 
filtered channel features detectors umbrella. We have sys¬ 
tematically explored different filter banks for such architec¬ 
ture and shown that they provide means for important im¬ 
provements for pedestrian detection. Our results indicate 
that HOG-fLUV features have not yet saturated, and that 
competitive results (over Caltech and KITTI datasets) can 
be obtained using only them. When optical fiow inform¬ 
ation is added we set the new state of art for the Caltech 
dataset, reaching 17.1% MR (93% recall at 1 false positive 
per image). 

In future work we plan to explore how the insights of 














this work can be exploited into a more general detection 
architecture such as convolutional neural networks. 
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A. Learned model 

In figures 7 and 8 we present some qualitative aspects 
of the final learned models Checkerboards4x3 and 
RandomFliters (see results section of main paper), not 
included in the main submission due to space limitations. 

In figure 7 we compare the spatial distribution of our 
models versus a significantly weaker model (Roere 1, 
trained on INRIA, see figure 5 of main paper). We ob¬ 
serve that our strong models focalize in similar areas than 
the weak Roerel model. This indicates that using filtered 
channels does not change which areas of the pedestrian are 
informative, but rather that at the same locations filtered 
channels are able to extract more discriminative informa¬ 
tion. 

In all three models we observe that diagonal oriented 
channels focus on left and right shoulders. The U colour 


channel is mainly used around the face, while L (luminance) 
and gradient magnitude (|| • ||) channels are used all over the 
body. Overall head, feet, and upper torso areas provide most 
clues for detection. 

In figure 8 we observe that the filters usage distribution 
is similar across different filter bank families. 


L 




L 





U 





(c) Final RandomFilters model 


Figure 7: Spatial distribution of learned models. Per channel on the left, and across channels on the right. Red areas 
indicate pixels that influence most the decision (used by more decision trees). Figures 7b and 7c show our learned models 
(reach ^18% MR on Caltech test set), flgure 7a show a similar visualization for a weaker model (^46% MR). See text for 
discussion. 
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(a) Filters used in our final Checkerboards4x3 model 
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(b) Filters used in our final RandomFilters model 


Figure 8: Frequency of usage of each filter as feature for decision tree split node (independent of the feature channel). Left 
and right we show the top-10 and bottom-10 most frequent filters respectively. 

Uniform filters are clearly the most frequently used ones (also used in methods such as (Roere i, ACF and 
(Squares) ChnFtrs), there is no obvious ordering pattern in the remaining ones. Please note that each decision tree 
will probably use multiple filters across multiple channels to reach its weak decision. 






























































































